Wine Quality Reds Exploration by HanByul Yang

Summary of the data set

[1] 1599   12
 [1] "fixed.acidity"        "volatile.acidity"     "citric.acid"         
 [4] "residual.sugar"       "chlorides"            "free.sulfur.dioxide" 
 [7] "total.sulfur.dioxide" "density"              "pH"                  
[10] "sulphates"            "alcohol"              "quality"             
'data.frame':   1599 obs. of  12 variables:
 $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
 $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
 $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
 $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
 $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
 $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
 $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
 $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
 $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
 $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
 $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
 $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
 fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
 Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
 1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
 Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
 Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
 3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
 Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
   chlorides       free.sulfur.dioxide total.sulfur.dioxide
 Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
 1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
 Median :0.07900   Median :14.00       Median : 38.00      
 Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
 3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
 Max.   :0.61100   Max.   :72.00       Max.   :289.00      
    density             pH          sulphates         alcohol     
 Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
 1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
 Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
 Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
 3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
 Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
    quality     
 Min.   :3.000  
 1st Qu.:5.000  
 Median :6.000  
 Mean   :5.636  
 3rd Qu.:6.000  
 Max.   :8.000  
  1. Data set consists of 1599 red wine and have 11 input attributes and 1 output attributes.
  2. The quality is varies from 3 to 8 with median 6.
  3. The alcohol is varies from 8.4% to 14.9%.
  4. The median qulity of red wine is 6. median residual.sugar is 2.2 g / dm^3. median alcohol of red wine is 10.2 %.

Univariate Plots Section

fixed.acidity

Most red wines have fixed acidity between 7.10 g/dm^3 and 9.20 g/dm^3.

Most red wines have volatile acidity between 039 g/dm^3 and 0.64 g/dm^3. There are some outliers above 1.4 and I removed them.

citric.acid

The citric acid has three peaks around 0, 0.25 and 0.5g/dm^3.


FALSE  TRUE 
 1467   132 

About 9% (132/1599) red wines have no citric acid.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.090   0.260   0.271   0.420   1.000 

Median citric.acid is 0.260 g/dm^3.

table of citric.acid


   0 0.01 0.02 0.03 0.04 0.05 0.06 0.07 0.08 0.09  0.1 0.11 0.12 0.13 0.14 
 132   33   50   30   29   20   24   22   33   30   35   15   27   18   21 
0.15 0.16 0.17 0.18 0.19  0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28 0.29 
  19    9   16   22   21   25   33   27   25   51   27   38   20   19   21 
 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39  0.4 0.41 0.42 0.43 0.44 
  30   30   32   25   24   13   20   19   14   28   29   16   29   15   23 
0.45 0.46 0.47 0.48 0.49  0.5 0.51 0.52 0.53 0.54 0.55 0.56 0.57 0.58 0.59 
  22   19   18   23   68   20   13   17   14   13   12    8    9    9    8 
 0.6 0.61 0.62 0.63 0.64 0.65 0.66 0.67 0.68 0.69  0.7 0.71 0.72 0.73 0.74 
   9    2    1   10    9    7   14    2   11    4    2    1    1    3    4 
0.75 0.76 0.78 0.79    1 
   1    3    1    1    1 

There is an outlier that has 1.0 g/dm^3.

Removed outliers above 0.8 and Adjusted bin width for better visualization.

residual.sugar

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.900   1.900   2.200   2.539   2.600  15.500 

The histotram of residual sugar has one peak and long-tailed. Most of red wines have residual sugar between 1.9 g/dm^3 to 2.6 g/dm^3: median 2.2g/dm^3 and mean 2.539 g/dm^3.

Transformed x-axis with log10() for better visualization.

chlorides

chlorides seems to have some outliers above 0.4g/dm^3.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
0.01200 0.07000 0.07900 0.08747 0.09000 0.61100 

Most of red wines have chlorides between 0.07 g/dm^3 to 0.09 g/dm^3: median 0.079 g/dm^3 and mean 0.08747 g/dm^3.

Removed outliers above 0.3 g/dm^3 and adjusted bin width for better looking.

free.sulfur.dioxide

There seems to be some outliers on histogram of free.sulfur.dioxide

table of free.sulfur.dioxide


   1    2    3    4    5  5.5    6    7    8    9   10   11   12   13   14 
   3    1   49   41  104    1  138   71   56   62   79   59   75   57   50 
  15   16   17   18   19   20   21   22   23   24   25   26   27   28   29 
  78   61   60   46   39   30   41   22   32   34   24   32   29   23   23 
  30   31   32   33   34   35   36   37 37.5   38   39   40 40.5   41   42 
  16   20   22   11   18   15   11    3    2    9    5    6    1    7    3 
  43   45   46   47   48   50   51   52   53   54   55   57   66   68   72 
   3    3    1    1    4    2    4    3    1    1    2    1    1    2    1 

Most free.sulfur.dioxide values are integers except 2 of them.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   1.00    7.00   14.00   15.87   21.00   72.00 

Major part of free.sulfur.dioxide is between 7 mg/dm^3 and 21 mg/dm^3.

Removed outliers above 60 mg/dm^3.

total.sulfur.dioxide

There are 2 outliers above 170 mg/dm^3 on histogram of total.sulfur.dioxide.

table of total.sulfur.dioxide


   6    7    8    9   10   11   12   13   14   15   16   17   18   19   20 
   3    4   14   14   27   26   29   28   33   35   26   27   35   29   33 
  21   22   23   24   25   26   27   28   29   30   31   32   33   34   35 
  25   25   34   36   27   24   30   43   20   14   32   20   17   20   26 
  36   37   38   39   40   41   42   43   44   45   46   47   48   49   50 
  12   26   31   16   17   14   26   18   23   20   17   24   21   21   11 
  51   52   53   54   55   56   57   58   59   60   61   62   63   64   65 
  11   15   14   20   13   10    6   14    9   18    9    9   13   10   17 
  66   67   68   69   70   71   72   73   74   75   76   77 77.5   78   79 
   9   12   10    8    8    7   10    7    8    5    3    8    2    4    5 
  80   81   82   83   84   85   86   87   88   89   90   91   92   93   94 
   4    6    4    2    6    9   10    6   14    9    5    7    8    2    8 
  95   96   98   99  100  101  102  103  104  105  106  108  109  110  111 
   4    5    7    6    3    4    6    2    5    5    6    3    4    6    3 
 112  113  114  115  116  119  120  121  122  124  125  126  127  128  129 
   3    4    2    2    1    7    2    4    3    3    2    1    2    2    3 
 130  131  133  134  135  136  139  140  141  142  143  144  145  147  148 
   1    3    3    2    2    2    1    1    3    1    2    3    3    3    2 
 149  151  152  153  155  160  165  278  289 
   1    2    1    1    1    1    1    1    1 

The table of total.sulfur.dioxide values shows all of total.sulfur.dioxide are integer.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   6.00   22.00   38.00   46.47   62.00  289.00 

Most red wines have a total.sulfur.dioxide between 22 mg/dm^3 and 62 mg/dm^3. Median is 38 mg/dm^3.

I removed outliers above 170 mg/dm^3.

density

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.9901  0.9956  0.9968  0.9967  0.9978  1.0040 

The density variable seems to display a normal distribution with major values between 0.995 and 1.0.

pH

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  2.740   3.210   3.310   3.311   3.400   4.010 

The pH also seems to have a normal distribution. Most of red wines have a pH between 3.21 and 3.4: median 3.31 and mean 3.311.

sulphates

The sulphates has outliers above 1.5 g/dm^3 and has peak around 0.6.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
 0.3300  0.5500  0.6200  0.6581  0.7300  2.0000 

Median of sulphates is 0.62 g/dm^3.

Ignored above 1.4 g/dm^3 as outliers for better visualization.

alcohol

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   8.40    9.50   10.20   10.42   11.10   14.90 

quality

The alcohol varies between 8 to 14 with major peaks around 10. Most of red wines have a alcohol between 9.5 and 11.1: median 10.2 and mean 10.42.

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  3.000   5.000   6.000   5.636   6.000   8.000 

  5   6   7   4   8   3 
681 638 199  53  18  10 

All of quality values are integers and between 3 and 8. Most of red wines have a quality between 5 and 6: median 6 and mean 5.636

quality_class

   low medium   high 
    10   1372    217 

I created “quality_class” for simple classification. It has three levels of quality.

  1. low (quality <= 3 or 0, 1, 2, 3)
  2. medium (4 <= quality <= 6 or 4, 5, 6)
  3. hight (7 <= quality or 7, 8, 9, 10)

85.8% (1372 / 1599) are medium quality.

Univariate Analysis

What is the structure of your dataset?

There are 1599 red wines and have 13 variables(11 input features and 2 output features. (quality and quailti_class). There are 12 variables from the csv files. I added 1 varaible for the analysis.

What is/are the main feature(s) of interest in your dataset?

The main features in the data set is quality. I’d like to find which chemical properties influence the quality of red wine. I suspect alcohol is highly related with quality, since red wine is a kind of liquor.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

I think flavor or taste may highly related with quality. So acid related varaiables such as pH or citric.acid and residual.sugar will help the investigation.

Did you create any new variables from existing variables in the dataset?

I created ‘quality_class’ variable from existing variable ‘quality’. It categorizes quality as three level, low, medium and high. about 86% of wine is medium quality.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

The citric acid has three peaks around 0, 0.25 and 0.5g/dm^3. 132 red wines (about 9%) have no citric.acid.

‘quality’ variable is just numerical value. For the simple classification of quality, I divided quality into 3 levels and made a new variable ‘quality_class’.

Bivariate Plots Section

                     fixed.acidity volatile.acidity citric.acid
fixed.acidity           1.00000000     -0.256130895  0.67170343
volatile.acidity       -0.25613089      1.000000000 -0.55249568
citric.acid             0.67170343     -0.552495685  1.00000000
residual.sugar          0.11477672      0.001917882  0.14357716
chlorides               0.09370519      0.061297772  0.20382291
free.sulfur.dioxide    -0.15379419     -0.010503827 -0.06097813
total.sulfur.dioxide   -0.11318144      0.076470005  0.03553302
density                 0.66804729      0.022026232  0.36494718
pH                     -0.68297819      0.234937294 -0.54190414
sulphates               0.18300566     -0.260986685  0.31277004
alcohol                -0.06166827     -0.202288027  0.10990325
quality                 0.12405165     -0.390557780  0.22637251
                     residual.sugar    chlorides free.sulfur.dioxide
fixed.acidity           0.114776724  0.093705186        -0.153794193
volatile.acidity        0.001917882  0.061297772        -0.010503827
citric.acid             0.143577162  0.203822914        -0.060978129
residual.sugar          1.000000000  0.055609535         0.187048995
chlorides               0.055609535  1.000000000         0.005562147
free.sulfur.dioxide     0.187048995  0.005562147         1.000000000
total.sulfur.dioxide    0.203027882  0.047400468         0.667666450
density                 0.355283371  0.200632327        -0.021945831
pH                     -0.085652422 -0.265026131         0.070377499
sulphates               0.005527121  0.371260481         0.051657572
alcohol                 0.042075437 -0.221140545        -0.069408354
quality                 0.013731637 -0.128906560        -0.050656057
                     total.sulfur.dioxide     density          pH
fixed.acidity                 -0.11318144  0.66804729 -0.68297819
volatile.acidity               0.07647000  0.02202623  0.23493729
citric.acid                    0.03553302  0.36494718 -0.54190414
residual.sugar                 0.20302788  0.35528337 -0.08565242
chlorides                      0.04740047  0.20063233 -0.26502613
free.sulfur.dioxide            0.66766645 -0.02194583  0.07037750
total.sulfur.dioxide           1.00000000  0.07126948 -0.06649456
density                        0.07126948  1.00000000 -0.34169933
pH                            -0.06649456 -0.34169933  1.00000000
sulphates                      0.04294684  0.14850641 -0.19664760
alcohol                       -0.20565394 -0.49617977  0.20563251
quality                       -0.18510029 -0.17491923 -0.05773139
                        sulphates     alcohol     quality
fixed.acidity         0.183005664 -0.06166827  0.12405165
volatile.acidity     -0.260986685 -0.20228803 -0.39055778
citric.acid           0.312770044  0.10990325  0.22637251
residual.sugar        0.005527121  0.04207544  0.01373164
chlorides             0.371260481 -0.22114054 -0.12890656
free.sulfur.dioxide   0.051657572 -0.06940835 -0.05065606
total.sulfur.dioxide  0.042946836 -0.20565394 -0.18510029
density               0.148506412 -0.49617977 -0.17491923
pH                   -0.196647602  0.20563251 -0.05773139
sulphates             1.000000000  0.09359475  0.25139708
alcohol               0.093594750  1.00000000  0.47616632
quality               0.251397079  0.47616632  1.00000000

The alcohol and sulphates are the most correlated features with quality. The volatile.acidity is the best negatively correlated with quality.

First, I will look into scatterplots involving quality and highly correlated variables, such as alcohol, sulphates, volatile.acidity.

alcohol and quality

Since the scatterplot is overplotted, I used jitter and alpha for a better visual.

I used factored variable ‘quality’ for boxplot. The boxplot shows that medians alcohol value of each quality have positive slope.

With density function, we can see the positive correlation between alcohol and quality.

The plot of linear model shows positive correlation between alcohol and quality.


Call:
lm(formula = quality ~ alcohol, data = wineSubset)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.8489 -0.4065 -0.1787  0.5176  2.5909 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.81782    0.17512   10.38   <2e-16 ***
alcohol      0.36646    0.01672   21.92   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7083 on 1596 degrees of freedom
Multiple R-squared:  0.2314,    Adjusted R-squared:  0.2309 
F-statistic: 480.4 on 1 and 1596 DF,  p-value: < 2.2e-16

The linear model of alcohol and quality has R-squred value 0.2314. ‘wineSubset’ is the subset of original data set without outlier of alcohol above 99.9%.

sulphates and quality

Used jitter and alpha for better visualization. There are some outliers.


Call:
lm(formula = quality ~ sulphates, data = wine_sulphates_Subset)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.02595 -0.51097 -0.02595  0.47064  2.39707 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  4.44423    0.09018   49.28   <2e-16 ***
sulphates    1.83920    0.13573   13.55   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7653 on 1581 degrees of freedom
Multiple R-squared:  0.1041,    Adjusted R-squared:  0.1035 
F-statistic: 183.6 on 1 and 1581 DF,  p-value: < 2.2e-16

‘sulphates’ is second positively correlated with quality. After removing outlier above 99%, The linear model has R-squred value 0.1041.

voltile.acidity and quality

Similar to plot of sulphates and quality, used jitter and alpha. It seems to be shown negative correlation between sulphates and quality.


Call:
lm(formula = quality ~ volatile.acidity, data = wine_v_acidity_Subset)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.78977 -0.54547 -0.01325  0.47198  2.92568 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       6.55757    0.05841  112.27   <2e-16 ***
volatile.acidity -1.74500    0.10503  -16.61   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7436 on 1596 degrees of freedom
Multiple R-squared:  0.1474,    Adjusted R-squared:  0.1469 
F-statistic:   276 on 1 and 1596 DF,  p-value: < 2.2e-16

‘volatile.acidity’ is most negatively correlated with quality. The linear model has R-squred value 0.1474 without outlier abvoe 99.9%.

Flavor and quality

I’ll investigate flavor related variables and quality.

  1. residual.sugar for sweetness
  2. chlorides for salty
  3. citric.acid for fressness

residual.sugar and quality

To see the trend of median, I used boxplot. It seems to be no correlation between residual.sugar and quality.


Call:
lm(formula = quality ~ residual.sugar, data = wine_r.sugar_subset)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.6743 -0.6319  0.3560  0.3717  2.3778 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)     5.60528    0.05277 106.220   <2e-16 ***
residual.sugar  0.01211    0.01993   0.608    0.544    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.809 on 1581 degrees of freedom
Multiple R-squared:  0.0002335, Adjusted R-squared:  -0.0003989 
F-statistic: 0.3692 on 1 and 1581 DF,  p-value: 0.5435

R-squared value of linear model is 0.0002335. It can be considered sweet flavor does not affect on deciding quality.

chlorides and quality

Same as above, used boxplot. It seems to be shown negative correlation between chlorides and quality.


Call:
lm(formula = quality ~ chlorides, data = wine_chlorides_subset)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7136 -0.6505  0.2800  0.3653  2.3653 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.90638    0.05881 100.430  < 2e-16 ***
chlorides   -3.15950    0.65812  -4.801 1.73e-06 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.803 on 1581 degrees of freedom
Multiple R-squared:  0.01437,   Adjusted R-squared:  0.01375 
F-statistic: 23.05 on 1 and 1581 DF,  p-value: 1.73e-06

R-squared value is 0.01437. It is bigger than residual.sugar but still low. ‘chlorides’ show the amount of salt in wine. salty flavor also merely related with quality.

citric.acid and quality

Plot seems to shows positve correlation between citric.acid and quality.


Call:
lm(formula = quality ~ citric.acid, data = wine_c.acid_subset)

Residuals:
     Min       1Q   Median       3Q      Max 
-3.01809 -0.59820  0.09909  0.50922  2.59711 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  5.37360    0.03371 159.384   <2e-16 ***
citric.acid  0.97651    0.10144   9.627   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7847 on 1595 degrees of freedom
Multiple R-squared:  0.05491,   Adjusted R-squared:  0.05432 
F-statistic: 92.68 on 1 and 1595 DF,  p-value: < 2.2e-16

R-squred value of linear model between citric.acid and quality is 0.055 but it is much bigger than chlorides and residual.sugar.

Freshness flavor is the most important flavor among three flavors I mentioned above.

similar variables investigation

It is not related with finding variables for deciding quality. There are some variables related each other. such as sulfur dioxide family and acidity family.

free.sulfur.dioxide and total.sulfur.dioxide


Call:
lm(formula = total.sulfur.dioxide ~ free.sulfur.dioxide, data = wineQuality)

Residuals:
    Min      1Q  Median      3Q     Max 
-55.120 -13.534  -7.325   7.570 197.126 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)         13.13535    1.11367   11.79   <2e-16 ***
free.sulfur.dioxide  2.09969    0.05858   35.84   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 24.5 on 1597 degrees of freedom
Multiple R-squared:  0.4458,    Adjusted R-squared:  0.4454 
F-statistic:  1285 on 1 and 1597 DF,  p-value: < 2.2e-16

total.sulfur.dioxide seems to be relatively high correlated with free.sulfur.dioxide. R-squraed value is 0.4458 for its linear model.

volatile.acidity and pH


Call:
lm(formula = pH ~ volatile.acidity, data = wine_v_acidity_Subset)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.56954 -0.09889  0.00115  0.09310  0.65578 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       3.20394    0.01179 271.649   <2e-16 ***
volatile.acidity  0.20307    0.02121   9.575   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1502 on 1596 degrees of freedom
Multiple R-squared:  0.05432,   Adjusted R-squared:  0.05373 
F-statistic: 91.68 on 1 and 1596 DF,  p-value: < 2.2e-16

R-squared value is relavely low 0.05432. pH does not affected by volatile.acidity.

fixed.acidity and pH


Call:
lm(formula = pH ~ fixed.acidity, data = wine_f.acid_subset)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.51754 -0.06571  0.00170  0.06486  0.52156 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)    3.81657    0.01385  275.64   <2e-16 ***
fixed.acidity -0.06076    0.00163  -37.27   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1128 on 1596 degrees of freedom
Multiple R-squared:  0.4654,    Adjusted R-squared:  0.465 
F-statistic:  1389 on 1 and 1596 DF,  p-value: < 2.2e-16

Based on R-squared balue, fixed.acidity can explain about 46.5% of the variance in pH. As the median value of fixed.acidity 7.9 g/dm^3 and volatile.acidity 0.52 g/dm^3, fixed.acidity mainly affects pH.

citric.acid and pH


Call:
lm(formula = pH ~ citric.acid, data = wineQuality)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.50025 -0.07733 -0.00570  0.08251  0.58251 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.427491   0.005562  616.25   <2e-16 ***
citric.acid -0.429477   0.016668  -25.77   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.1298 on 1597 degrees of freedom
Multiple R-squared:  0.2937,    Adjusted R-squared:  0.2932 
F-statistic:   664 on 1 and 1597 DF,  p-value: < 2.2e-16

R-squared value is 0.2937 and it is bigger than volatile.acidity linear model. Median value of citric.acid is 0.26g/dm^3 and smaller than volatile.acidity(0.52g/dm^3)

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Based on correlation value, I investigated relationship between quality and other variables. Alcohol is the most influential chemical of red wine.

At the view of flavors, there seems to be no highly related variables to quality. Especially, salty(chloride) and sweet(residual.sugar) taste have almost no influence with quality of wine.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

I observed chemically similar variables. As expected, free.sulfur.dioxide and total.sulfur.dioxide is highly related. (“Total”" includes “free”“)

Relationship between acid variables and pH is also interesting. As expected, higher acidity shows low pH except volatile.acidity. linear model between volatile.acidity and pH has low R-squared value 0.05432. Fixed acidity is most dominent fluence on pH and its R-square value is 0.4654, as we expected on median values of acid variables.

What was the strongest relationship you found?

Relationship between fixed.acidity and pH is most highly related. Its correlation value is 0.68 and R-squared value of linear model is 0.4654.

Multivariate Plots Section

First, I’d like to investigate base on correlation.

alcohol, volatile.acidity and quality

I used “factor” to quality and alpha for better visualization. It seems to shows weak negative correlation between alcohol and volatile.acidity.

For the better investigation by each quality. So I used facet_wrap(). It is easily found that lowest(3) and highest(8) quality are distributed differently on the scatterplot.


Call:
lm(formula = quality ~ volatile.acidity/alcohol, data = wineQuality)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.65872 -0.41597 -0.03492  0.47365  2.12949 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)               6.43388    0.05333  120.65   <2e-16 ***
volatile.acidity         -7.26558    0.32048  -22.67   <2e-16 ***
volatile.acidity:alcohol  0.55595    0.03092   17.98   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6784 on 1596 degrees of freedom
Multiple R-squared:  0.2953,    Adjusted R-squared:  0.2944 
F-statistic: 334.3 on 2 and 1596 DF,  p-value: < 2.2e-16

volatile.acidity / alcohol is negatively correlated with quality. Its linear model has 0.2954 R-squared value. It is higher than R-squared value of linear model of alcohol and quality.

alcohol, sulphates and quality

Same as plots of alcohol, sulphates and quality, used factor to quality and alpha for better visualization. It seems to be positive correlation alcohol and sulphates.

Similar to plot of alcohol, sulphates and quality, qual(3) and highest(8) quality are distributed distantly on the scatterplot.


Call:
lm(formula = quality ~ sulphates/alcohol, data = wineQuality)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.7909 -0.3643 -0.1458  0.5097  2.4503 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        4.86396    0.06846   71.05   <2e-16 ***
sulphates         -4.25244    0.26375  -16.12   <2e-16 ***
sulphates:alcohol  0.51926    0.02322   22.36   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.6825 on 1596 degrees of freedom
Multiple R-squared:  0.2866,    Adjusted R-squared:  0.2857 
F-statistic: 320.6 on 2 and 1596 DF,  p-value: < 2.2e-16

Ratio of sulphates and alcohol is positively correlated with quality and R-squared value of linear model is 0.2866. It is also higher than R-squared value of just alcohol and quality linear model.

volatile.acidity, sulphates and quality

Used alpha and factor to quality for better visualization. There is negative correlation between volatile.acidity and sulphates.

There is no noticeable character scatter plots by each quality.


Call:
lm(formula = quality ~ sulphates/volatile.acidity, data = wineQuality)

Residuals:
     Min       1Q   Median       3Q      Max 
-2.94529 -0.50157 -0.03746  0.48209  2.89781 

Coefficients:
                           Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 5.13683    0.07624   67.37   <2e-16 ***
sulphates                   1.99426    0.12124   16.45   <2e-16 ***
sulphates:volatile.acidity -2.39588    0.16352  -14.65   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.7343 on 1596 degrees of freedom
Multiple R-squared:  0.1743,    Adjusted R-squared:  0.1732 
F-statistic: 168.4 on 2 and 1596 DF,  p-value: < 2.2e-16

sulphates / volatile.acidity is relatively postive related with quality. Its linear model has 0.1743 R-squared value.

similar variables investigation

I’d like to continue from bivariate plots by adding quality variable. First, sulfur dioxide famliy (free.sulfur.dioxide and total.sulur.dioxide)

sulfur dioxide famliy

Sulfur dioxide family is positively correlated. There is no sigficant difference of each quality distribution.

acidity famliy

“pH” is related with acidity. I’ll check relationship between pH and acid related variables: volatile.acidity, fixed.acidity and citric.acid

pH is positively correlated with volatile.acidity.

Used facet_wrap for better visualization of each quality.

pH is negatively correlated with fixed.acidity.

Each quality plot has almost same distributions on fixed.acidity and pH plot.

pH is negatively correlated with citric.acid.

There is no noticeable difference among each quality.

pH is posively correlated with volatile.acidity with low correlation value but negatively correlated with fixed.acidity and citric.acid. There is no noticeable among quality class.

multivariate linear model


Calls:
m1: lm(formula = quality ~ alcohol, data = wineSubset)
m2: lm(formula = quality ~ alcohol + volatile.acidity, data = wineSubset)
m3: lm(formula = quality ~ alcohol + volatile.acidity + sulphates, 
    data = wineSubset)

===============================================
                     m1        m2        m3    
-----------------------------------------------
(Intercept)        1.818***  3.038***  2.547***
                  (0.175)   (0.185)   (0.196)  
alcohol            0.366***  0.319***  0.315***
                  (0.017)   (0.016)   (0.016)  
volatile.acidity            -1.384*** -1.221***
                            (0.095)   (0.097)  
sulphates                              0.685***
                                      (0.100)  
-----------------------------------------------
R-squared             0.231     0.322     0.341
adj. R-squared        0.231     0.321     0.340
sigma                 0.708     0.666     0.656
F                   480.388   378.330   274.938
p                     0.000     0.000     0.000
Log-likelihood    -1715.379 -1615.409 -1592.411
Deviance            800.741   706.568   686.521
AIC                3436.757  3238.818  3194.823
BIC                3452.887  3260.324  3221.705
N                  1598      1598      1598    
===============================================

With three hightest correlated variables (alcohol, volatile.acidity and sulphates), I build linear model for quality. Its R-squared value is 0.341.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

Relationship between quality and alcohol is strenthed by add another variables highly correlated with qulaity.

In case of quality and volatile.acidity / alcohol. The R-squared value : 0.2954

Linear model of Quality and sulphates / alcohol. The R-squared value : 0.2866

Both are higher than the R-squared value (0.2314) of linear model of quality and alcohol.

Were there any interesting or surprising interactions between features?

Relationship volatile.acidity and alcohol shows interesting result with using categrical quality_calss variable. The lowest(3) and highest(8) quality are distributed differently on the scatterplot.

Relationship among sulphates, alcohol and quality is observed interesting in same manner. The lowest(3) and highest(8) quality are distributed distantly.


Final Plots and Summary

Plot One

Description One

The highest correlated value with quality is alcohol. The linear model of quality and alcohol has R^2 value 0.2314. To improve R squared value, I added volatile.acidity and sulphates. The linear model of quality with three variable has 0.341 R squared value.

Plot Two

Description Two

There are three flavors in red wine, such as sweetness, salty and freshness. Base on each R-squared value, freshness(citric.acid) is most important flavor for quality decision. Sweetness is not related with quality of red wine. Important flavor order and R-squared values are followed below.

  1. citric.acid (R-squared value 0.055)
  2. chlorides (R-squared value 0.01437)
  3. residual.sugar (R-squared value 0.0002335)

Plot Three

Description Three

Using relationship volatile.acidity and sulphates with alcohol, It is shown that highest and lowest quality of red wine have noticeably diffrent distribution. Low ratio of volatile.acidity and alcohol indicates high quality red wine and high ratio of sulphates and alcohol indicates high quality red wine.

In case of ratio of volatile.acidity and alcohol, Quality 3 wine has 0.084 median value which is higher than quality 8 wine has 0.032 median value. Ratio of sulphates and alcohol case, the median value of quality 3 wine is 0.053 which is lower than median of quality 8 is 0.059.


Reflection

The data set contains 1599 red variants of the Portuguese “Vinho Verde” wine. I started by understanding the individual variables in the data set, and I was interested in “alcohol” feature because wine is a kind of liquor.

Since dataset is tidy, I don’t need to clean of filter it. However, all variables are numerical variables and It is hard to make bi or multi-variate plots. So I used ‘quality’ variable with discrete scale or used factor() for categorical plot. Even though I used factored quality, It is hard to recogize difference of distribution by quality in one plot. I made plots for each quality or filtered some interesting qualities.

I presumed sweetness is highly related with quality of red wine. But surprisingly, the highest important flavor of red wine is freshness by citric acid while sweetness is the lowest import flavor.

As I expected, the most correlated feature of quality is “alcohol” and there are another features that has relation with quality. “volatile.acidity” is also correlated with quality and “sulphates” is negatively correlated. The linear model with only “alcohol” variable has 0.231 R-sqaured value. By adding “volatile.acidity” and “sulphates”, R-squared value is increased with 0.341.

Since the data set consists of samples from the specific red wine mentioned above, there is a limitation of this analysis. It might be interesting to obtain data set from various regions to eliminate any bias created by various products.